AITopics | vision-centric task

Collaborating Authors

vision-centric task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Wenhai Wang 2 Zhe Chen 1,3 Xiaokang Chen 1,4 Jiannan Wu

Neural Information Processing SystemsFeb-16-2026, 22:29:37 GMT

It's noteworthy that, with a generalist LLMbased framework, our model can achieve over 60% mAP on COCO, on par with

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models Y ang Jiao

Neural Information Processing SystemsFeb-16-2026, 18:00:10 GMT

The current methods follow the paradigm of adapting the visual task outputs to language-oriented formats.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Neural Information Processing SystemsDec-26-2025, 16:56:33 GMT

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.

language model, open-ended decoder, visionllm, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

946ecab300b0695fe24b53a92e632935-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 10:08:13 GMT

arxiv preprint arxiv, instruction, language model, (14 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Wenhai Wang 2 Zhe Chen 1,3 Xiaokang Chen 1,4 Jiannan Wu

Neural Information Processing SystemsOct-9-2025, 06:37:42 GMT

It's noteworthy that, with a generalist LLMbased framework, our model can achieve over 60% mAP on COCO, on par with

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

Zheng, Weiying, Lin, Ziyue, Guo, Pengxin, Zhou, Yuyin, Wang, Feifei, Qu, Liangqiong

arXiv.org Artificial IntelligenceJun-12-2025

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict privacy requirements like healthcare. Recent efforts have introduced Federated Learning (FL) into VLM fine-tuning to address these privacy concerns, yet comprehensive benchmarks for evaluating federated fine-tuning strategies, model architectures, and task generalization remain lacking. In this work, we present \textbf{FedVLMBench}, the first systematic benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning strategies, five FL algorithms, six multimodal datasets spanning four cross-domain single-task scenarios and two cross-domain multitask settings, covering four distinct downstream task categories. Through extensive experiments, we uncover key insights into the interplay between VLM architectures, fine-tuning strategies, data heterogeneity, and multi-task federated optimization. Notably, we find that a 2-layer multilayer perceptron (MLP) connector with concurrent connector and LLM tuning emerges as the optimal configuration for encoder-based VLMs in FL. Furthermore, current FL methods exhibit significantly higher sensitivity to data heterogeneity in vision-centric tasks than text-centric ones, across both encoder-free and encoder-based VLM architectures. Our benchmark provides essential tools, datasets, and empirical guidance for the research community, offering a standardized platform to advance privacy-preserving, federated training of multimodal foundation models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.09638

Country: Asia (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Add feedback

Hidden in plain sight: VLMs overlook their visual representations

Fu, Stephanie, Bonnen, Tyler, Guillory, Devin, Darrell, Trevor

arXiv.org Artificial IntelligenceJun-10-2025

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.08008

Country: North America (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Liao, Yuan-Hong, Elflein, Sven, He, Liu, Leal-Taixé, Laura, Choi, Yejin, Fidler, Sanja, Acuna, David

arXiv.org Artificial IntelligenceApr-23-2025

Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.15362

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Education (0.35)

Technology: